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Abstract. A recurrent neural network with noisy input is studied analytically, on the basis of 
a Discrete Time Master Equation. The latter is derived from a biologically realizable learning 
rule for the weights of the connections. In a numerical study it is found that the fixed points of 
the dynamics of the net are time dependent, implying that the representation in the brain of a 
fixed piece of information (e.g., a word to be recognized) is not fixed in time. 
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1. Introduction 

It is our purpose to construct and describe a neural net that can learn and retrieve 
patterns in a way that is biologically realizable. In an actual situation, the input to 
a network is noisy: the brain is confronted with patterns that are similar, but not 
identical. Therefore, we are going to model the training phase of a neural network by 
considering p sets, f2 M , of similar patterns x, centered around p typical patterns, £ M 
= 1,... ,p). 

In the existing literature, learning rules are used which are based on typical 
patterns and not on sets of similar patterns f2 M . This is biologically unrealistic: 
a child does not learn 'standard words', pronounced by a 'standard speaker', but 
hears the same word pronounced by different speakers in different ways, i.e., the child 
is exposed to sets In order to model biologically realistic learning, we use a 

learning rule which contains patterns x belonging to learning sets £! M rather than the 
patterns £ M alone. 

We show that when this learning rule, based on noisy input patterns is used, 
the network evolves to values for the strengths of the synaptic connections, usually 
called 'weights', that fluctuate with respect to certain fixed asymptotic values. For 
an actual brain this corresponds to the fact that the confrontation with input data 
leads to synaptic connections that change in strength, through all of their lifetime, 
but in such a way that there is stability in what it stores and recollects. When a 
biological neural network gets as input a pattern that it has learned a long time ago 
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already, and which, via the tuning of the synapses, has found a firm and fixed place 
on the substrate formed by the neural tissue, it nevertheless changes the synapses. 
This is not necessary of course, but can not be circumvented in an actual biological 
network. Input always changes the connections, since there is no way for an individual 
neuron to know whether or not a pattern has been encountered earlier. All this means 
that the learning rule must be such that the learning speed is not too large Q: new 
information might otherwise destroy too much of the old information and, hence, the 
network's functioning. 

So far, we have been considering the situation that patterns arc presented to the 
network and that, through some learning rule, synaptic connections are changed 
A next point to discuss is how an actual, biological, neural net, while changing its 
connections continuously, can nevertheless recognize a pattern. In other words, what 
is the difference, for a neural net, between an old, i.e., already stored pattern, and a 
new, unknown pattern? As noted before, also an old pattern is 'stored', in the sense 
that it gives rise to a change of the connections. The answer to the question is that 
albeit that each individual neuron and each synapse reacts independently of the fact 
whether a pattern has been stored or not, the net as a whole reacts differently in these 
two cases: see section ||, in particular figure |. 

Neural nets have been studied with fixed 1 1, |5| , and with adapting — or dynamic — 
synapses f|, The latter neural nets are also called nets with double dynamics 
. In the context of spin glasses one speaks of coupled dynamics . In the last part 
of this article, we will be concerned with neural networks with double dynamics: see 



section 3.5 



We study what happens in a net with ever changing connections, by comparing 
what happens when a pattern is presented to the net that has been learned before to 
what happens when this is not the case. These two cases are investigated numerically 
on the basis of a particular learning rule, for which we have chosen the one we derived 
earlier ||: it is a Mixed Hebbian-Anti-Hebbian, Hopfield like, learning rule, which 
is non-symmetric with respect to post- and pre-synaptic input, and which contains, 
moreover, a post-synaptic potential dependent factor. We found this rule assuming 
that building and destroying of a synapse costs biochemical energy, and by requiring, 
at the same time, that the energy needed to change a neural network be minimal. 
We suppose that the patterns x 6 fi M that are presented to the net (the various ways 
in which one and the same 'word' is presented) are chosen randomly from a set of 
patterns distributed around a set of p typical patterns ^ (the p 'standard words' to 
be learned). 

Random processes can often be described in a useful way via a so-called Master 
Equation for the relevant random variables ^ |l(J. We therefore start, in section ||, 
by deriving a Discrete Time Master Equation for the random variables in question, 
namely the weights toy of the connections of the network. Usually, a Master Equation 
is solved going from discrete time to continuous time, which always entails some 
essential difficulties |ll], [l2| [l3|. Such a transition to a process that is continuous 
in time is often advantageous, since a differential equation, in general, is easier to 
solve than a difference equation. In our approach, the transition to the differential 
equation could be circumvented, since we had in this case at our disposal a tool that 
turned out to enable us to directly solve the difference equation itself: the Gauss-Seidel 
iterative method. 

A question one might raise is whether a system with ever changing connections 
will ever achieve some kind of stationary state. A numerical study can not easily 
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answer this question, since the fluctuations of the weights are quite wild (see figure g) . 
We therefore performed an analytic study, based on the particular learning rule used 
throughout our work. We found that the system's weights will fluctuate around certain 
asymptotic values, and that the last stored pattern that has given rise to a fixed point 
is roaming over f2 M , the collection of patterns around a typical pattern £ M . All this 
can be rephrased by stating that both the neural net itself and its particular states, 
the fixed points, wiggle around average values: the ever changing mind is, in some 
sense, stable. 

The above can be summarized as follows. In section ^| we derive the Discrete 
Time Master Equation for the weights of a neural network, from a learning rule, and 
solve this equation analytically. In section we have two objectives: firstly, to check 
numerically the analytical result of section |2j, and, secondly, to study the implications 
of double dynamics. 



2. The weights of a network trained with noisy patterns 

In a preceding article we derived what we have called an 'energy saving learning rule'. 
When at time t n [n = 0, 1, . ..) the weights are given by Wij(t n ), and if, thereupon, 
a pattern £ = (£i, . . . ,£/v) is presented to a net of N neurons, then the weights are 
changed according to the rule 

Wij(t n+ i) = w i: j(t n ) + Awij(t n ) (j eVi) (1) 

where 

A WlJ (t n ) =th[k- 7<(£, t»i(*n))](26 - (j e Vi) (2) 

(see ||, equations (41) and (42)). The index n in t n labels subsequent moments of 
the net: to is the initial time, where the weights have there initial values, Wij(to). In 
these equations Vi is the collection of indices j with which neuron i is connected via 
adaptable, non-zero synapses. Furthermore, for the so-called stability coefficients we 
used the abbreviation 

N 

■yi(x,Wi(t)):=(£2w u (t)xi-6i)(2xi-l) (i = l,...,N) (3) 
i=i 

where Wi := (wn, . . . , win). The quantity arises naturally in case the dynamics of the 
network is taken to be given by equation (|5|) below; see, e.g., ||. In the learning rule 
(||) occur two quantities, r/i and k. For a non-biological system they can be expressed 
in terms of properties of the neural net and as a function of the patterns £ M to be 
stored in the net: for rjj, the so-called learning rate, see equations (42)and (65) of 
for k, the so-called margin parameter, see [[l4|. For biological systems, the coefficients 
r/i and k are replaced by suitable constants: for rji see ||, section 6; for k see Q, 
section 3. 

It was shown that a repeated application (n — > oo) of the rule (^|) gave rise to the 
following expression for the weights at some — finite or infinite — time t^ 

, ^[«-7f(*o)](2^-l)(Crir^ (jeV) 

Wij{toc) = { ^=1 (4) 

U e Vf) 
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where 7f(to) : — 7i(^> w i(to)) and where G? y is the reduced correlation matrix, 
defined by 

cr:=N -ij2e k e k (5) 

(see ]8| equation (52)). This type of learning, and the ensuing expression for the 
weights, Wijltoo), correspond to the idealized situation of ideal, i.e., unperturbed, 
input. 

2.1. The energy saving learning rule for learning with noise 

In a realistic situation, however, the repeated training of a network will not take place 
with patterns £ M which remain exactly the same throughout all of the training process. 
Each time a certain pattern ^ is presented to the network, it may be slightly different. 
Therefore, rather than studying what happens when patterns £^ are presented, we will 
study what happens when patterns x are learned, which belong to sets of patterns 
f2^ clustered around typical patterns £^ (fi = 1, . . . ,p). In other words, we allow the 
patterns to be alike — but not necessarily exactly equal — to one of the p representative 
patterns we allow for what is called, technically, 'noise'. 

In a preceding article, we studied patterns x, belonging to sets of patterns f2 M 
clustered around typical patterns in the context of basins of attraction In 
other words, we introduced as a means to construe, by hand, basins of attraction. 
In the present article, we use i7 M to represent noisy patterns. Thus the sets f2 M 
appearing in the two articles have the same meaning, namely sets of patterns around 
a typical pattern, but the reasons for introducing them is different: in the preceding 
article there was a mathematical motivation, whereas in the current article it is 
motivated by the biological reality that patterns are never exactly equal. 

The purpose of this article is to determine the values for the weights tUy , in case of 
learning with noisy patterns. We start by simply conjecturing that for noisy patterns 
the old rule can essentially be maintained: all what we do is replacing in (j|) the 
£^'s by cc's. Hence, we take as learning rule 

Awij(x,t n ) = 7]i[K - ji(x,Wi(t n ))](2xi - l)xj (j e Vi) ■ (6) 



The learning rate r\i figuring in this expression will be discussed in section ^4; the 
margin parameter k has been discussed in a preceding article H] . We will prove that 
this learning rule leads, on the average, to suitable values for the weights. From this 
we conclude that the energy saving learning rule is suitable also for learning the right 
patterns £ M , on the basis of wrong (i.e. perturbed) input patterns x of fi^. 

2.2. The Discrete Time Master Equation 

When learning takes place in a biological neural network with a learning rule of the 
type (^), we assume that the changes of the weights at time < n +i depend only on 
the values of the weights Wi at time t n and the variable x — (x%, . . . ,x n ) which is 
randomly drawn out of the collection fl — U^f^', the union of disjunct sets of 
patterns x centered around typical patterns at time t n . Consequently, learning 
with a learning rule like (^|) is a Markovian process, and the weights Wij are stochastic 
variables. Thus we have for the new weights w'^: 

, _ / + Awij(x, (j e Vi) 



Wij (j € Vf 
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where the Awij(x, Wi) are the increments given by the learning rule (^), 

Awij(x,Wi) = th[k -ji(x,Wi)](2xi - l)xj (j £ Vi) . (8) 

Let Tij(w' i j\iVij) be the probability density that a transition takes place from the 
value Wij to the value w'^ . Then we have 

T ij( w 'i]\ w ij) = ^2 v(x)8{w' i:j - w i: j - Awij(x,Wi)) (9) 
xen 

where p{x) is the probability to draw x from 0. The (5-function guarantees that 
only transitions take place which obey the learning rule (|^). Using a probability p{x) 
normalized to unity, i.e., 

E K*) = 1 (io) 

xeQ 

we find from (^) the following total transition probability to a state Wij : 

dw' ij T ij {w' ij \w i: j) = l. (11) 

Let Pij{uiij,t n+ i) be the probability of occurrence of the variable Wij at t n +\. Then, 
the probability i^y and the transition probability Tij are related according to 



P ij (w ij ,t n+ i)= I dw^ Tijiwiilw'^PiiM^tn) (i = l,...,N;j GVi). (12) 
Let, moreover, the probability Pj» be normalized according to 

dwij P tj (wij , t n ) = 1 . (13) 
From ([□]) and (jl|) it follows that 

P t] {w ljl t n+1 )-P lJ {w l0 ,t n ) = J dw' tj [T l0 {w lj \w' 1J )P lJ {w' tj ,t n ) ^ 

- T ij(w'ij\mj)Pij{mj,tn)] {i = 1, ■ • • , N; j G Vi) 

which is the Discrete Time Master Equation for the weights Wy. 
Next, let us consider the average of the weights at t n : 

(wij)t n -= JdWijPij(wij,t n )wij (j e V) (15) 
or, using the normalization (|l3|), 

N 

{wij)t n = \\ I dwikPik{w ikl t n )wij {jeVij. (16) 
k=i J 

Using the Master Equation ([li]) , we obtain for the change of the weights 

N f 

k=l J 

- T! ifc (u)- fe |u) ife )P ife (w ife ,t„)]w7y . 
Interchanging the primed and unprimed variables in the first term we find 

N 

{Wij) t 



(17) 



N 

)t n = 11/ dwi k dw' ik (w'^ - Wij)T ik {w' ih \w i k)Pik{u} i k-,~tn) (18) 
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or, with (||), 

N 

( w ij)t n + i ~ (Wij)t n = E II / dWik ^ W ij{ X ^ W i) P ik{ W ik^tn) {j G V£) . (19) 

Inserting (||) , and using ( |l3| ) and ( |l5| ) we find a first order difference equation for the 
variable (ify)t„ 

( W ij)t„+i ~ ( w ij)t n = 51 K 33 ) 7 ?^ ~ TiC 35 ) ( w <)t„)]( 2x * ~ i)^' 0' S Vj) (20) 

which can be solved once more is known about the probability p(x). 

Let p fJ, (x) be the probability to draw x from the disjunct collections f2 M , 
normalized according to 

^2p»(x) = l. (21) 

Then the probability to draw x from £1 = U^f^ is 

= - (22) 

in agreement with the normalization (|l(]), as may be verified with (|2l| ) and j2^). 

Let Pi(xi) be the probability that in the collection $7 M the i-th component of a; 
has the value x%. Then the probability Pi(xi) to draw from the collection the 
vector a; is given by 

N 

P"(x)=l[^(x i ). (23) 

i=l 

By choosing the normalization according to 

E rffc) = 1 ( 24 ) 

x«=0,l 

we find that ( f2lf ) is satisfied. 

We now introduce the average with respect to the set fi M : 

xt := E P"0"0*< • ( 25 ) 
In view of (|§), (||), @ and @ we have 

= "X>? (26) 
xen P m=i 

1 P 

^2p(x) Xl Xj = -E^i- (27) 
a;eo ^ 

With these relations we may rewrite the difference equation for the average weights, 
equation (j2Fj), in the form 

1 P 

K)t„ +1 ~ (tityk = - E - (»*>t»)](2x? ~ 1)^ (j S Vi) (28) 

This equation will be solved, in the next section, in the limit of large n. Note that the 
7i's now contain the averages x^. 
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The equation (|28|) for the average value of the weights lUy at time t n , can be solved, 
for n — > oo, in a way that closely parallels the method of Diederich and Opper fli5fl . 
First, using (p8| ) recursively, we arrive at 

p 

Ki)tn +1 = K-)t +^" 1 X)^(^)^ (* = l,... J ^;jeVi) (29) 
where 

n 

Ff (t n ) = £ £ [/s _ 7 . (i M (^} tm )](2xf - 1) (30) 

m— 

with a = p/N. From ( |30| ) it follows that 

F?(t n ) - Ft(t„-i) = - 7 i(*", («; i ) tn )](2^ - 1) (31) 
a 

or, using (g), 

if (* n ) - F^^-x) = ^[«(2a£ - 1) - (£ + E Mto*Z ~ h)] ■ (32) 

Eliminating (wik)t n from (|32|) via (|2|) yields 

N 

-(Ft(t n ) - Ft(t n ^)) =\n{2xt - 1) - (J2Mt xt - 9i)] 

(33) 

To solve this set of linear equations, we shall rewrite them in matrix notation. First 
of all, let us introduce a p x p matrix, Gi, with matrix-elements given by 

Cr:=N-iJ24*t- (34) 

fcev, 

We will refer to this matrix as the 'correlation matrix of averages'. The connection 
with a usual correlation matrix becomes more apparent in case can be replaced by 
Then the 'correlation matrix of averages' is identical to the 'reduced correlation 
matrix' || . We also introduce &pxp diagonal matrix H with diagonal elements given 
by = a/rji. Finally we shall denote a p x p unit matrix as /. Apart from the 
above mentioned matrices, we introduce the vectors F,(i„) := {Fl(t n ) , ■ ■ ■ , Ff (t n )) 
and Gi := {Gj, . .., G?) with components Gf = [k(2z? - 1) - (X)fc=i ( w ik)t K - Oi)}- 
With these notations and abbreviations ( |33| ) can be recast in the simple form 

Hi ■ Fi(t n ) = {Hi - d) ■ Fifa-!) + G % (35) 

Solving this equation iteratively for Fi(t n ) we obtain 

F t (t n ) =[#r 1 . ^ - Ci)] n Fi(t ) + H^[I + H^ . { Hi - C t ) 

i - i (36) 

+ . . . + [Hf 1 ■ {H - Gj)] ] • Gi 

The matrix Gj, as defined in (|34|), is easily seen to be positive definite and symmetric. 
It then can be shown that the matrix H~ x ■ {Hi — Gj) has eigenvalues smaller than 
one jn| . As a consequence, we have 

lim [Hr 1 . (Hi - &)] n = . (37) 
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This implies that, in the limit n — > oo, ( p6| ) converges to 

F»(ioo) = fl i - 1 [ 7 --ffr 1 -( ff i-c , i)] _1 - G < 

= C7 l .Gi. (38) 

Substitution of J38]), in index- not at ion, into (^9|) yields the following result for the 
average weights {u>ij}t 00 (j 6 Vi) after a learning process with noisy patterns 



(Wij) ta 

where 



(Wij)to +Vij 



(i e 



(39) 



AT" 



\X,V— 1 



K - 7l (^,K), )](2o;f-l)(C- 1 ) 



(40) 



Hence, although the weights Wij(t n ) themselves do not converge, since the increments 
Awij(t n ), given by eq. (Q), do not tend to zero for n going to infinity, the average 
(wij(t n )) does. We thus find that the actual, biological weights Wij(t n ) of the neural 
net fluctuate around the average value (w-y)*^ (see figured). The expression (^9|)-(^o|) 
for the average weights of a network trained with noisy patterns constitutes the main 
analytical result of this article. 

In the following section we carry out a numerical analysis on the process of 
learning patterns that are perturbed. We will use a particular expression for the 
probability distribution p^(x), namely, eq. (43), which is the same as the one used 
in a preceding article []l4|, in order to compare the biological result (|39|)-(p0|) for the 
weights and the result for the weights obtained in case of a mathematical approach 
aiming at creating fixed points with prescribed basins of attraction. 



3. Numerical analysis 

On the basis of the result found above, in particular equations (|39|)-(^0|), we expect 
that the energy saving learning rule (^) applied to a set of noisy input patterns x E il^ 
will lead to satisfactory results. By satisfactory we mean that the system can recognize 
patterns belonging to the clusters of patterns $1^. Still more in detail we mean that, 
after a certain number of learning steps, each cluster fi^ has a fixed point, y M say, 
whereas the other patterns of f2 M belong to the basin of attraction of this fixed point 
y M . It should be noted that when a new pattern z M € f2 M is learned, the old fixed 
point G il can be replaced by a new fixed point z^. This is a direct consequence 
of the learning rule with a pre-factor given by (|48"|), which is such that the last 
learned pattern becomes automatically, and in one learning step only, a fixed point 
of the dynamics, see the preceding article S. In this preceding article, we considered 
learning of p patterns £^ (jj, = 1, . . . ,p). In the language of the present article, we can 
say that we studied, in the preceding article, the learning of 'clusters' Q^ 1 (/i = 1, . . . , p), 
each consisting of one single pattern, y M = Since there was only one pattern per 
cluster f2 M , the fixed point remained the same during all of the learning process. In 
this article, the situation is a little bit different: the fixed point of a cluster is 
'roaming' over 0^, i.e., the fixed point is no longer fixed during all of the learning 
process. 
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3.1. A measure for the performance of a neural net 

A criterium for the way in which a neural net functions may be based on the stability 
coefficients 7,. The more of them are positive, for given sets of typical patterns 
the better the net fulfils its task of storing and recollecting patterns ||. Inserting 
the final expression for the weights, eqs. j39|)-(|40|), into the definition of the stability 
coefficients ji, eq. (||), it follows that at t^ they are given by 

7<(*",<u><)tJ = K (i = l,...,N;fi = l,...,p). (41) 

In view of (|27]), the 7-function of the average x M can be replaced by the average of 
the 7-function of x itself 

T?(x,(w i )t ao ) = K (i = l,...,N;ii=l,...,p). (42) 

Since the margin parameter k is positive, the latter equation implies that, on the 
average, the 7* are positive. We now recall that for a perfectly functioning network all 
7i should be positive. Therefore, by calculating the fraction of 7i's that are positive 
in various cases, we can judge the quality of a neural network. 



3.2. A useful probability distribution 

We now address the question whether there exists, after a certain number of learning 
steps, a (roaming) fixed point for each cluster. An alternative way of putting this 
question is to ask whether there exist z 1 , . . . , z p such that the 7i(z'*, it>, (t)) are all 
positive at a certain time t. We will investigate this question numerically. To that 
end, we choose a particular form for the clusters O' 1 by specifying the choice of the 
probability distribution p^(x), equations (22)-fl25j), We take for its i-th factor 



p?(xi) = (l-b)5 Xi4 »+b6 : 



,1-Cf ( 43 ) 
where & is a parameter between and 1, which wc will refer to as the 'noise-parameter'. 
If b = (no noise), only the patterns x = £ M have a non-zero probability of occurrence. 
For values of b close to zero any vector x has a non-zero probability of occurrence, 
but only vectors x close to one of the ^ have a probability of occurrence comparable 
to the probability of occurrence of a typical pattern. The particular choice (^) for 
p fl (x) enables us to construct the collection of vectors x to be used as learning input 
vectors in our numerical calculation. Since in the derivation of the Master Equation 
the clusters f2 M have been chosen disjunct, a vector x cannot belong to more than 
one cluster. According to the probability distribution (p3|), however, a vector x which 
belongs to a certain cluster M , has a — very small, but — non-zero probability to 
belong to any other cluster. This implies that ( |43| ) is not exact but only a — very 
good — approximation to the actual situation, for which these probabilities vanish 
exactly. 

The expression (|i"3f) can be used to calculate the average xf . Inserting ( p2| ) with 
(H) and @ into Qwe obtain 



E [o- 



bS 



a* =0,1 



n e [>-&)*.*.<! 

k^i Xk— 0,1 



bS 



or 



(44) 



(45) 
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For 6 = 0, the case of patterns without noise, reduces to Using this fact in our 
present main analytical result, given by equations (|39|)-(|40|), one indeed recovers the 
old result (Q) for the final values of the weights in case of noiseless input. 

It is also instructive to compare the result for the weights (ify ) too , equations (|39|), 
( pO| ) with ( ff5| ) and the result of the preceding article jlij. In the latter article, we 
calculated the weights of a recurrent neural net, denoted as iuy(t), in case fixed points 
and basins of attraction are taken explicitly into account, see equations (l)-(2) of |Q. 
The only difference is found in the factor 2x% — 1, which, in our result based upon 
basins, reads 2£f — 1. In fact, we find for the difference of the preceding and present 
results: 

wa(t) - K-) too = (tf " ^Wr'rx" (46) 

where we used the expressions (l)-(2) of |l4| and d39|)-(fiC|), together with the 
definitions of 7 in the two cases. Furthermore, we put {vjij}t — u>ij(to) for the 
initial values of the weights. Since 

&-x!t=b(2&-l) (47) 

the difference ( fl6| ) is of the order b, i.e., small compared to the weights themselves. 
Consequently, the biological system considered here is found to be able to realize the 
optimal values Wij(t) for the weights derived earlier fiijl , up to terms that are small 
compared to unity. This is intriguing, since, a priori, there is no reason to expect that 
a biological learning rule based on economy of energy to rebuild a synapse § will lead 
to values of the weights that are a good approximation to the values found from the 



requirement that there are fixed points with prescribed basins 14 . 

Due to the fact that the final results of [Q and this article for the weights are 
very similar, one may expect that natural, biologically learning via the learning rule 
(||) for noisy patterns x G 17 M (/1 = 1, . . . ,p) will lead to larger basins of attraction 
than in case of learning of noiseless patterns. 



3. 3. Storage of noisy patterns 

Having chosen the sets S1 M , via (xi), eq. (fill), we are able to simulate a learning 
process with noisy patterns. What we will do in the numerical study below is to pick an 
input vector x according to a probability distribution as given by equation (fi"3|). Next, 
we calculate the new synaptic weights tu,j (t n+ i), equation ([!]), using the energy saving 
learning rule @. Finally, we calculate the pN stability coefficients 7,(z*\ Wi(t n +x)) 
(i = 1, . . . , N), where z^ stands for the last learned pattern of f^ 1 (fi = 1, . . . ,p), 
coefficients which we hope to be positive. The result is shown in figure |l|. 

The left and right columns in figure hi correspond to two particular choices for 
the factor r\i occurring in the learning rul e (|q) , namely the 'local' and 'global' learning 
rules, which will be described in section p. 4 . Going downwards in one of these two 
columns, the number of learning steps rises. In each figure, we have put the number 
of gamma's as a function of its value. It is seen that after 300 learning steps almost 
all 7's are positive. Hence, at the 300-th step, most of the last learned patterns z^ are 
fixed points indeed. We note that it is instructive to compare the results of learning 
of patterns with and without noise: see figure 3 of Q. There is almost no difference 
in case of the local learning rule, whereas noise seems to diminish a little bit the 
effectiveness of the (non-biological) global learning rule. 
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Figure 1. Noisy patterns. The performance of a neural network is measured by the stability 
coefficients 74 related to neuron i, which should be positive for a properly functioning neural net. 
In the figure we have plotted averages of these 7's, for series of noisy input patterns, for the local 
learning rule (left column) and the global learning rule (right column). The number of learning 
steps increases from 32 (top) via 160 and 320 to 640 (bottom). The average is taken over 100 
sets of p = 32 patterns for a neural network with TV = 128 neurons. The noise parameter which 
yields the sets C is taken to be b = 0.01. The calculations have been performed starting with a 
tabula rasa for the weights: Wij(to) = 0, and for neurons with vanishing thresholds: 8i = 0. The 
dilution in the network is d = 0.2, the average activity of the net is a = 0.2. The normalization 
of the weights has beenfixed such that k = 1. Both in case of global and local learning, the 
proposed learning rule (ph leads to a very satisfactory result: almost all 7's become positive. 
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Figure 2. Fluctuations of the weight of a connection around its average. The 

value of an arbitrarily chosen synaptic connection w is followed during 20, 000 learning steps, 
for the local learning rule. A similar picture is obtained in case of the global learning rule. The 
average value (ui)t 00 as predicted by equations (pjj)— (poj) is depicted as a horizontal dashed line. 
It is observed that the value of the connection fluctuates around the dashed line. The same 
parameters are chosen as in figure hi, i.e., N = 128, p = 32, a = 0.2, b = 0.01, d = 0.2, K = 1, 
w ij (t ) = and 9i = for all i andj . 



Finally, to illustrate the fact that the weights of the synaptic connections fluctuate 
around the average value as given by the expression (p9|)-(flol) , we have plotted, in 
figure |5[ the time-evolution of the weight of an arbitrarily chosen connection together 
with its average value. 

3.4- The learning rate rji 

So far, we have been concerned with learning, and problems related to learning. In 
our study of the learning process, the weights Wij of the synaptic connections changed 
according to the learning rule (|^), in which a factor rji occurred, the so-called learning 
rate, which, so far, was left unspecified. In this subsection, we focus the attention on 
this factor rji. In a preceding article, we showed that in a process of ideal learning, 
i.e., such that the energy needed to change the synapses is minimal, the learning rate 
rji was given by 

m = 1/ J2 x* ■ ( 48 ) 
feeVi 

Depending on neuron activity not restricted to two neurons only, this factor is non- 
local and therefore biologically unrealistic (see [||, section 6). In a biological context, 
it should be replaced by some local approximation, for instance, a constant like 

rj t = l/Na (i = l,...,N) (49) 

where a denotes what is called the mean-activity, i.e., it is the probability that an 
arbitrary neuron i is in the state +1. In figure [[[ the left column of pictures corresponds 
to the biological, local learning rule, i.e., equation (||) with rji given by (|is|), while the 
right column corresponds to the global learning rule, i.e., equation with r/i given 
by ©. 

3.5. Retrieval by a biological network with ever changing connections 

Is there more one can say on the value for the learning rate rji in case of a biological 
network than that it should be an approximation to the value ( [i"8| ) , which guarantees 
that the process of learning takes place in an energetically most economical way? The 
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answer is affirmative in case one requires that the learning is good enough to store 
patterns, but is not that good that it stores each learned pattern in only one learning 
step, as is the case for the global learning rule, i.e., the learning rule with r\i given by 
equation Jig). In other words, it will turn out in this section that it is of advantage 
to learn via a learning rule that is not able to always store a new pattern in only one 
learning step. The reason for this counter-intuitive requirement which we are going 
to impose, is a consequence of the fact that we demand that the network be able to 
retrieve patterns and change connections at the same time. 

In most models of neural networks one distinguishes between a learning phase and 
a retrieval phase. In the learning phase the weights are changed according to some 
rule, in the retrieval phase the weights are kept fixed. In a biological neural network 
such a separation of phases does not occur. Weights do not stop changing in the 
retrieval phase when a stimulus is presented, and this is precisely what is happening 
when a neural network has to recognize a pattern. If the change due to the stimulus 
would be too close to the 'ideal' value (|48|) , the network would change in such a way 
that every new pattern would immediately be learned, and, hence, be recognized. And 
this is not what should be the case: if every new pattern would be stored immediately, 
it could not easily be distinguished from a pattern that had been stored in the network 
a long time ago already. Therefore, we must require that r\i is sufficiently unequal to 
the value (f48f), which it has in case of the global learning rule. If we take r\i larger 
and larger with respect to the value (|4S|), network changes will become too large for 
the network to function properly [Q. So we are left with the possibility that r\i has 
a value somewhere between zero and the value (|||), which is large enough to store 
patterns, and small enough for the network to distinguish between new and formerly 
learned patterns. 

The above qualitative statements should now be made quantitative. In figure [| 
we consider the storage of one pattern. We have plotted the percentage of positive 7's 
as a function of the learning rate r\. For rj in the range (3/iV,ll/iV) all, or almost all, 
7's are positive after one learning step. For n « 1/N, only 80% of the 7's are positive 
after one learning step. We conclude from all this that the factor r\ figuring in the 
learning rule (||) should be of the order of 3/N or less. Such a value guarantees that 
a biological network, which is bound to change its connections also during retrieval, 
does not learn so fast that it recognizes patterns already after one learning step, as in 
the case of the global learning rule ([48|). 

In order to compare what happens when an already learned pattern is presented 
to the network with what happens in the net when a totally new pattern is the input, 
it is useful to define the overlap function of the two input patterns. The 'overlap', 
Q{x, y), of two binary patterns x and y of N bits defined in the usual way, is given 
by ' 

1 N 

Q(x,l/) = -53(2*4 -1)(2 W -1). (50) 

i=l 

If Xi — yi for all (i = 1, . . . , N), the overlap takes its maximal value +1; if xi = 1 — yi, 
for all z, the overlap takes its minimal value —1. In figure ^ we compare the functioning 
of a neural network that changes its connections during the process of 'recognition' 
of previously learned patterns (left column) and random, non-learned, patterns (right 
column), for values of r] going down from 3/N (top) to 1/N (bottom). In the left 
column we have plotted, vertically, the overlap Q(x(t n ) , z^ (t n )) of an arbitrary learned 
pattern x(t n ) € f2 that is presented to the net, and the last learned fixed point z^ . 
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Figure 3. Storage efficiency for local learning. The neural net has been initialized by 
taking a tabula rasa for the weights, followed by some learning process. Then, the fraction of 
positive 7's, after one more learning step, has been plotted as a function of the learning rate 
7]. For a biological network with ever changing connections r\ should not be too effective, i.e., 
not in the range 3/JV to 1X/N. We took a neural net with TV = 512 neurons, we used 20 initial 
learning steps. Furthermore, we took thresholds 64 = 0, dilution d = 0.2, activity a = 0.2 and 
noise level b = 0.01. 



Both may change permanently in time because of the continuous updating of the 
weights. In the right column we have plotted, vertically, the overlap Q(x(t n ), x(t n +i)) 
of an arbitrary pattern x(t n ), not previously presented to the network and the pattern 
x(t n+ i) generated by the network one time step later. In both columns, time steps 
are plotted along the horizontal axis. 

In our numerical study, the connections toy(t n ) are changed according to the rule 
(El) combined with the learning rule (Q); the patterns x(t n ) are updated using the 
usual dynamics jl8|, page 20] 

N 

Xj(t n+ i) = Qn(/,Wij(t n )xj(t n ) - 6j) (i = l,...,N) (51) 
j'=i 

applied parallelly, i.e., at a time t n , all neurons i update their states Xi(t n ) 
simultaneously. The learned patterns x are chosen according to the probability p M (x) 
around £ M , equation ( [43] ) . The arbitrary, non-learned, patterns x are chosen randomly 
with mean activity a = 0.2. 

In each of the four pictures of the left column learned patterns are presented to the 
network, and followed during ten learning steps. In the top left picture no recognition 
takes place, whereas for lower values of rj the recognition capability of the network 
rises. In the bottom left picture recognition always takes place. In case recognition 
takes place for an x e f2 M , it is found that the fixed point 2 M does not change in time. 
If, however, recognition does not occur, the fixed point z^ was found to change in 
time. Observe that in some cases the learned patterns seem to evolve to a two-state 
attractor, in contrast to what one might expect. In fact, we showed in article || that 
when a new pattern is learned with the global form of learning rule (^) , one arrives at 
a one-state attractor (fixed point) after one learning step: see the first new paragraph 
of H under equation (43). Hence, we may expect that in the left column, where we 
use a local learning rule that approximates the global one, one-state attractors would 
occur only. The occurence of two-state attractors can only be a consequence of the 
fact that, in contrast to the treatment of Q, the weights always change in time and/or 
the approximation of the global learning rule by a local learning rule. 

In case random patterns are the input to the network, there is in general no 
evolution of the network to one of the fixed points z^\ a numerical study of the 
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Figure 4. Retrieval of learned and random patterns. In each of the pictures at the 
left, a learned pattern is presented six times to a network, for r) = 3/N, 2/N , 1.5/N and l/N 
(top to bottom). The recognition is preceded by a learning stage which took 300 learning steps. 
A pattern is recognized when the overlap is 1. In each of the pictures at the right, a random 
picture is presented five times to the network, after a learning period of the same number of 
300 time-steps. Again r\ varied from 3/N to 1/7V (top to bottom). A pattern is seen to evolve 
almost always to some other, stable, pattern, for all values of r], since the overlap with the 
preceding pattern almost always tends to 1. The network considered had the following network 
properties. Number of neurons N = 512, number of patterns p = 16, mean activity a = 0.2, 
dilution d = 0.2, noise parameter b = 0.05. The parameter rj occurs in the learning rule (^) 
used to perform the updating of the weights of the connections. 
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overlap of a random pattern and any of the turned out to yield an overlap which 
was always less than 1. What we have pictured in the right column of figure |4] is 
evolution of a random pattern during 20 time steps. For all values of r\ a random 
pattern evolves to a pattern that remains stable or almost stable under the network 



dynamics. In related cases (see, e.g., [17 and |18[ section 4.1) these 'spurious states' 
are found to vanish when the dynamics of the network is taken to be stochastic rather 
than deterministic. 

Not only in case of learned patterns but also in case of random patterns fixed 
points 2 M have been found which change in time. This is due to the fact that the 
weights change continuous in time or due to the approximation of the global learning 
rule by a local learning rule. We do not pursue this and other points related to figure |] 
any further. 



4. Conclusions 

The basis of this article is the learning rule for noisy patterns, equation (||). We 
found, by a numerical study of this learning rule, that storage of noisy patterns leads 
to fixed points that move around in collections f2 M that are representative for the noisy 
patterns. An analytical study of the same learning rule reveals that the weights found 
via this rule fluctuate, as long as learning or retrieval of patterns takes place, around 
certain average values, for which the explicit expression given by (|39|)-(ft0|) could be 
derived. 

In the limit of vanishing noise in the input, we recover the expressions for the 
weights obtained earlier || on the basis of a totally different approach, namely, 
economy of energy in case of synaptic change. This is satisfactory, because it yields 
an independent check of their correctness. 

A comparison with other results obtained earlier fl4f| , in which we determined 
the optimal weights for a neural net with prescribed basins of attraction, shows that 
the biological updating rule of the present article, eq. @, realizes the latter results 
via eqs. (|39|)-(^0|) up to terms of the order of the noise parameter h, which are small 
compared to one. 
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